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problem. Ensemble learning is a method used in clustering; multiple runs are 
executed that produce different results for the same data set. Then the final 
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1. INTRODUCTION 

Clustering is a popular exploratory data analysis tool for gaining and understanding of data 
structure. It is the task of identifying subgroups in data so that data points within the same subgroups (cluster) 
are extremely similar while data points within different clusters are very dissimilar. In other words, we strive 
to discover homogeneous subgroups within the data so that data points in each cluster are as comparable as 
feasible based on a similarity measure like Euclidean-based distance or correlation-based distance [1], [2]. 
The critical concerns in clustering are; which similarity metric should be used, how many clusters may be 
found in the data, which clustering method is the “best”, how should algorithmic parameters be chosen, are 
the individual clusters and partitions correct [3]. 

K-means is one of the most widely used for its characteristics such as; speed and simplicity [4]. It 
has been used in different fields [5], [6]. It is an iterative technique that attempts to split a dataset into k 
separate non-overlapping subgroups (clusters) [7], each of which contains only one data point. It attempts to 
make intra-cluster data points as comparable as possible while maintaining clusters as distinct (far) as 
possible. It distributes data points to clusters in such a way that the sum of the squared distances between 
them and the cluster’s centroid (arithmetic mean of all the data points in that cluster) is small as possible [8]. 

Within clusters, the less variance there is, and the more homogenous (similar) the data points are. If 
cluster have spherical-like shape, the K-means method is good at capturing data structure. It tries to build a 
good spherical shape around the centroid at all times. That means, as soon as the clusters have sophisticated 
geometric shapes, K-means fails to cluster the data [9]. In addition, it is necessary to predefine the number of 
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cluster (k). It cannot deal with noisy data or outliers, Cluster having non-convex forms are not suited for 
detection [1], [8]. In addition, the final outcome is controlled by the original initial centroids. 

In terms of consistency and quality, a clustering ensemble tries to integrate numerous clustering 
models to provide a better outcome than the individual clustering algorithms [10], [11]. It refers to a situation 
in which a number of different runs, as a result different clusterings have been obtained for a particular 
dataset, then to find a single (consensus) clustering [12]. Most of existing ensemble methods have tried to 
obtain the most consistent clustering result with base clusterings, “accuracy” in clustering does not have a 
clear meaning because it is unsupervised [13]. The term “Three-way decision” refers to a group of efficient 
methods and heuristics employed in human problem solving and information processing. Three-way 
clustering employs the core region and peripheral (fringe) region to represent a cluster as an application of 
Three-way decision in clustering [10], [14], [15]. Core region provide the pure clustering for objects and as a 
result it can be used in improving the clustering. Therefore, it was suggested to be merged with K-means 
algorithm in order to be improved and reduce its sensitivity problem with random initial centroids. This 
hypothesis was evaluated in this paper through practical work using some experiments. 


2. METHOD 

The work in this paper is based on two fields of methods; traditional clustering wit k-means 
algorithm and ensemble clustering that can be combined into proposed work in order to achieve more 
performance. 


2.1. K-means algorithm 

The unsupervised classification of patterns into groups (clusters) is clustering [16]. The most well- 
known and often used clustering technique is the k-means algorithm. In the literature, several k-means 
extensions have been proposed. K-means technique and its expansions are always impacted by initializations 
with a necessary number of clusters a priori [17], while being an unsupervised learning to clustering in 
pattern recognition and machine learning. In other words, the k-means algorithm isn’t quite an unsupervised 
clustering technique [1], [17]. Despite its widespread use, the algorithm has certain drawbacks. Includes 
issues with centroids that are randomly initialized, resulting unexpected convergence [1], [18]. Therefore, 
running the algorithm multiple times, different compilation results can be obtained each time, depending on 
initial centroid. Different solutions have been proposed to solve the algorithm problems [18], [19]. 


2.2. Cluster ensemble 

Cluster ensemble techniques seek to develop stronger and more resilient clustering solutions by 
combining information from several data partitioning [20]. In another sense, it seeks to integrate various 
clustering models in order to create a superior outcome [18]. The ensemble technique was initially developed 
and extensively researched in the supervised learning domains. Because of its effectiveness in classification 
problems, academics have sought to adapt the similar paradigm other unsupervised learning areas during the 
last decade or so, specifically clustering issues, because of two aspects [11]: i) there is usually no prior 
information about the underlying structure or any specific features that we wish to uncover, by forcing a 
certain structure onto the data, various clustering algorithm might generate different clustering results for the 
same data; ii) there is no one clustering method that can work consistently well for various issues, and for the 
choice of clustering algorithms for a specific problem there are no clear rules to follow. 


2.3. Three-way method 

As known, hard clustering uses two-way decision in order to produce a cluster, while there is need 
to deal with the uncertainty world that need more representation. Three-way method is based on three 
decisions to give more than single region of clustering [21]. Three-way Decision state that “according to the 
positive, boundary, and negative regions of a set, one can make a three-way decision: accept, abstain and 
reject” [22]. Accordingly, it can be considered as efficient methods and heuristic methods widely utilized for 
the resolution and processing of decision-making problems [22]. Below some basic fundamental facts 
regarding three-way clustering. Suppose that C=/CJ...., Ck/is a family cluster of universe V=/v,..., Un}. It 
uses a pair of sets to represent a Three-way cluster Ci [21]. 


Ci=(Co(Ci),Fr(Ci)) (1) 


where Co(Ci) CV and Fr(Ci)CV and Tr =V —(Co(Ci) UFr(Ci)). These sets, Co(Ci), Fr(Ci) and 
Tr(Ci) are represent Core Region, Fringe Region and Trash Region [21]. The outcome of three-way 
clustering will be: 
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C = {(Co(C,), Fr(C1)), (Co (C2), Fr(C2)), «.., (C0 CCk), Fr(Ck))} 


Then, apply a modified three-way decision clustering algorithm using the k-means algorithm 
according to steps: 
a. Execute original k-means algorithm multiple time. 
b. Select the best performance and elicitation average performance using Davies-Bouldin index (DB 
hereafter), Average Silhouette index (AS hereafter) and Accuracy (ACC hereafter) [23]-[25]. 
c. Elicitation the core region and fringe region as: 


Co(C;) = {vi = 1,..,m,v € Cj} = NEC, 


Fr(G) = fi + P. i, p = 1,..,m,uv € Cij AVE Cy; = UC = NEC 
All symbols that have been used in the equations should be defined in the following text. 


2.4. Measures of evaluation 

Clustering assessment, also known as cluster validity, is a key procedure in assessing the efficacy of 
learning technique in finding important groupings. A decent cluster quality measurement will assist to 
compare different clustering methods and to analyze whether an approach is preferable than another [21]. For 
evaluating the performance of algorithm, we used: 
a. Davies-Bouldin index [24], [25] (DB hereafter) 


1 


SICE ESIC: 
DB =~, mare, oD} 


d(xixj) 2) 
Which a lower value is better. 
b. Average Silhouette index [22] (AS hereafter) 


1 

AS ay oS, 3) 
Which a higher value is better. 

c. Accuracy (ACC hereafter) 


n? 


AcC=S%, © (4) 


Which a higher value is better. 


3. PROPOSED ALGORITHM 

The proposed algorithm is shown in Algorithm 1, is based on merging three-way technique with K- 
means algorithm. This can be done through several steps. First the traditional clustering-based k-means must 
be done for multiple (m) runs with different initial centroids. At each run, new initial centroids are provided, 
there is different results are produced. As a result, there is (m) different clustering, each object in data would 
be member to (m) clusters. Then these clusters would be introduced to ensemble three-way technique in 
order to construct "core" through intersection the objects' clusters from different runs, core region that 
contains the clustered objects purely and fringe region that contains other objects as shown in Figure 1. 


Proposed algorithm 1: 
1: Input: m K-means clustering results (Cj,Co,...,Cn) 
2: Three-Way ensemble re-clustering results 
C= {(Co(C,), Fr(C,)), (Co(C,), Fr(Cz)), ~, (Co(C,), Fr(C;,))} 
3 for each C; in {G}, i=2,...,m do 
4 for j to k do 
OF get cluster j+1 from Cı 
6 for p to k do 
7 get cluster p+1 from C 
8 overlap (j,p)= Count (Cij, Cip)? 
//overlap is a kxk matrix 
// Count (Cij,Cip) count the number of same elements of Cy and Cj 
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9:3 end for 
switch-tab= @ // switch-tab is kx2 matrix 
for n to k do 
(u,v)= argmax(overlap(j,p)) // (uv) is the biggest element 
switch-tab(n,0)= v+1 
switch-tab(n,1)= u+1 
Delete overlap (u,*) 
Delete overlap (*,v) 
for each C; == v (from switch-tab) replace with u (from switch-tab) 
end for 
: for j 1 to k do 
Calculate Co(C) = NEC 
Calculate Fr(C,) = Uk Cy — NEG; 
: End for 
: Return C = {(Co(C,), Fr(C,)), (Co(C2), Fr (C2), ..., (Co (Ck), Fr(Cy))} 
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Run m: k-means 


Figure 1. Proposed algorithm 


4. RESULTS AND DISCUSSION 

The practical test in this paper was executed using popular data sets that are extracted from "UCI 
machine learning repository" site. The details of these datasets are listed in Table 1, with different details 
(samples, attributes, and classes), they are used for clustering task. The work in testing step of proposed 
algorithm was achieved through experimentation of traditional k-means algorithm and ensemble k-means 
algorithm and then different metrics were computed for each one. 


Table 1. Experiments' datasets details 
ID Datasets Samples Attributes Class 


1 Bank 1372 4 2 
2 Forest 325 27 4 
3 Seeds 210 7 3 
4 Sonar 208 60 2 
5 Wine 178 4 2 


It was executed with the traditional k-means algorithm and ensemble k-means algorithm. For each 
data set, there are three experiments were done in order to enable the comparison between the traditional k- 
means and ensemble k-means through computing the metrics (DB, AS, ACC) with each experiment. The 
experiments contain, the best k-means performance, the average k-means performance, and then the 
performance of ensemble k-means. From Tables 2-4, it possible to notice an improve in the results for the 
performance of Core Region compared to best performance and average performance for implementation of 
the traditional K-means algorithms, the lower value for metrics (AS, DB) while the higher values of ACC. 
This is due to the exclusion of elements in the Fringe region. Then by synchronizing the results to align each 
result and matching the names of the clusters by uniting the clusters labels, and by intersecting the clusters, 
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the most closely related objects in each cluster were identified (core region), and the marginal elements that 
are usually within the cluster boundary were isolated (fringe region). By excluding marginal objects, it 
became clear that the results could be improved. 


Table 2. Bank and forest datasets performances 


Data set Performance metric 
Experiment type Bank Forest 
DB AS ACC DB AS ACC 
Best K-means performance 0.453454718 0.000854869 0.729591837 0.060172805 0.006712123 0.710769231 


Average K-means performance 0.454606287 0.00085354 0.726530612 0.082387268 0.005659233 0.603384615 


Ensemble Kmeans 0.451699693  0.000861777  0.728205128 0.037559036 0.017986058 0.793548387 
(core region) 


Table 3. Seeds and sonars datasets performances 


Data set Performance metric 
Experiment type seeds Sonar 
DB AS ACC DB AS ACC 
Best K-means performance 0.157134501 0.009011163 0.938095238 0.014026729 0.000343687 0.581730769 
Average K-means performance 0.159051546 0.008910751 0.921428571 0.014120872 0.00021696 0.552884615 


Ensemble K-means 


: 0.147860874 0.009624 134 0.94 0.013599365 0.00029952 0.553846154 
(core region) 


Table 4. Wine data set performances 


Data set Performance metric 
Experiment type Wine 
DB AS ACC 
Best K-means performance 0.157134501 0.009011163 0.938095238 
Average K-means performance 0.15905 1546 0.0089 10751 0.921428571 


Ensemble K-means 


: 0.147860874 0.009624134 0.94 
(core region) 


5. CONCLUSION 

We applied the Three-way clustering re-ensemble method after modifying its algorithm to allow and 
improve the results obtained for the K-means algorithm after applying it several times. As the produced 
results that was shown from ensemble K-means, it is emergent performance. This is a good step for more 
related works in the future, as this method can be exploited by resetting centroids and then resetting the 
affiliation of the new incoming elements to the dataset without the need to repeat the process by measuring 
the distance between the new elements and the generated centroids. 
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